44 research outputs found

    Relay: A New IR for Machine Learning Frameworks

    Full text link
    Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation (IR) called Relay. Relay is being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability. We discuss the goals of Relay and highlight its important design constraints. Our prototype is part of the open source NNVM compiler framework, which powers Amazon's deep learning framework MxNet

    Compiler Support for Sparse Tensor Computations in MLIR

    Full text link
    Sparse tensors arise in problems in science, engineering, machine learning, and data analytics. Programs that operate on such tensors can exploit sparsity to reduce storage requirements and computational time. Developing and maintaining sparse software by hand, however, is a complex and error-prone task. Therefore, we propose treating sparsity as a property of tensors, not a tedious implementation task, and letting a sparse compiler generate sparse code automatically from a sparsity-agnostic definition of the computation. This paper discusses integrating this idea into MLIR

    Progressive Raising in Multi-level IR

    Get PDF
    International audienceMulti-level intermediate representations (IR) show great promise for lowering the design costs for domain-specific compilers by providing a reusable, extensible, and non-opinionated framework for expressing domain-specific and high-level abstractions directly in the IR. But, while such frameworks support the progressive lowering of high-level representations to low-level IR, they do not raise in the opposite direction. Thus, the entry point into the compilation pipeline defines the highest level of abstraction for all subsequent transformations, limiting the set of applicable optimizations, in particular for general-purpose languages that are not semantically rich enough to model the required abstractions. We propose Progressive Raising, a complementary approach to the progressive lowering in multi-level IRs that raises from lower to higher-level abstractions to leverage domain-specific transformations for low-level representations. We further introduce Multi-Level Tactics, our declarative approach for progressive raising, implemented on top of the MLIR framework, and demonstrate the progressive raising from affine loop nests specified in a general-purpose language to high-level linear algebra operations. Our raising paths leverage subsequent high-level domain-specific transformations with significant performance improvements

    Automatic Parallelization and Locality Optimization of Beamforming Algorithms

    Get PDF
    International audienceThis paper demonstrates the benefits of a global optimization strategy using a new automatic parallelization and locality optimization methodology for high performance embedded computing algorithms that occur in adaptive radar systems, for modern multi-core computing chips. As a baseline, the resulting performance was compared against the performance that could be obtained using highly optimized math libraries

    TC-CIM: Empowering Tensor Comprehensions for Computing-In-Memory

    Get PDF
    International audienceMemristor-based, non-von-Neumann architectures performing tensor operations directly in memory are a promising approach to address the ever-increasing demand for energy-efficient, high-throughput hardware accelerators for Machine Learning (ML) inference. A major challenge for the programmability and exploitation of such Computing-In-Memory (CIM) architectures consists in the efficient mapping of tensor operations from high-level ML frameworks to fixed-function hardware blocks implementing in-memory computations. We demonstrate the programmability of memristor-based accelerators with TC-CIM, a fully-automatic, end-to-end compilation flow from Tensor Comprehensions, a mathematical notation for tensor operations, to fixed-function memristor-based hardware blocks. Operations suitable for acceleration are identified using Loop Tactics, a declarative framework to describe computational patterns in a poly-hedral representation. We evaluate our compilation flow on a system-level simulator based on Gem5, incorporating crossbar arrays of memristive devices. Our results show that TC-CIM reliably recognizes tensor operations commonly used in ML workloads across multiple benchmarks in order to offload these operations to the accelerator

    Méthodes d optimisation scalables dans le modèle polyédrique

    No full text
    Limités par une augmentation incontrôlée de la dissipation d'énergie et de la complexité des circuits de contrôle, les processeurs actuels évoluent vers des architectures multi-coeurs avec de plus en plus de threads par coeur. La tâche consistant à générer du code efficace pour ces machines hautement hétérogènes incombe aux compilateurs. Le modèle polyédrique est une abstraction mathématique qui permet de représenter l'exécution de programmes contenant des boucles de contrôle affines. Il permet de résoudre des problèmes de compromis présents dans les compilateurs traditionnels. Dans la première partie de cette thèse, nous étudions les problèmes rencontrés par les compilateurs. Grâce au modèle polyédrique, nous montrons comment une méthode semi-automatique permet de passer outre. Dans la seconde partie, nous nous penchons sur le problème de l ordonnancement et donnons une formulation multidimensionnelle qui exprime l'ensemble des ordonnancements légaux sous forme d'un seul problème global. Nous revisitons l'analyse de dépendances et proposons une méthode pour corriger des séquences de transformations complexes. Dans la troisième partie nous discutons le problème de la génération de code et nous construisons le premier générateur de code polyédrique qui permet la réentrance. Nous introduisons également les notions de transformation à la génération de code et d'équivalence d'ordonnancement pour contrôler les optimisations de programme au niveau le plus fin possible. Les contributions de cette thèse ont été implémentées dans l'environnement URUK'' et dans le compilateur XLC d'IBMLimited by ever-increasing power consumption and control complexity, current processors have evolved to multiprocessor architectures on a chip with increasingly many cores per chip and multiple threads per core. Compilers are responsible for translating the idealistic operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. The polyhedral model is a sound mathematical abstraction to represent programs with affine control loops. It alleviates many tradeoffs hampering current optimizing compilers. In the first part of this thesis, we discuss the problems faced by compilers. We abstract from these and show how a semi-automatic approach alleviates the user from producing complex code for long sequences of transformations. In the second part, we discuss the scheduling problem. We provide a new formulation to express the set of all legal multidimensional schedules in a single optimization problem. We take a fresh look at dependence analysis and program legality. This allows us to devise a complex sequences automatic correction process to fix illegal transformations. In the third part we discuss the code generation issues and build the first fully reentrant framework based on the polyhedral model. We introduce the concept of code generation optimizing transformations and schedule equivalence to tailor syntactical optimizations at the finest possible grain. All the contributions have been implemented either in the URUK framework on top of the Open64 compiler, or as part of IBM's XLC compiler.ORSAY-PARIS 11-BU Sciences (914712101) / SudocSudocFranceF

    Automatic correction of loop transformations

    No full text
    Loop nest optimization is a combinatorial problem. Due to the growing complexity of modern architectures, it involves two increasingly difficult tasks: (1) analyzing the profitability of sequences of transformations to enhance parallelism, locality, and resource usage, which amounts to a hard problem on a non-linear objective function; (2) the construction and exploration of search space of legal transformation sequences. Practical optimizing and parallelizing compilers decouple these tasks, resorting to a predefined set of enabling transformations to eliminate all sorts of optimization-limiting semantical constraints. State-of-theart optimization heuristics face a hard decision problem on the selection of enabling transformations only remotely related to performance. We propose a new design where optimization heuristics first address the main performance anomalies, then correct potentially illegal loop transformations a posteriori, attempting to minimize the performance impact of the necessary adjustments. We propose a general method to correct any sequence of loop transformations through a combination of loop shifting, code motion and index-set splitting. Sequences of transformations are modeled by compositions of geometric transformations on multidimensional affine schedules. We provide experimental evidence of the scalability of the algorithms on real loop optimizations. 1

    Violated dependence analysis

    No full text
    The polyhedral model is a powerful framework to reason about high level loop transformations. Yet the lack of scalable algorithms and tools has deterred actors from both academia and industry to put this model to practical use. Indeed, for fundamental complexity reasons, its applicability has long been limited to simple kernels. Recent developments broke some generally accepted ideas about these limitations. In particular, new algorithms made it possible to compute the target code for full SPEC benchmarks while this code generation step was expected not to be scalable. Instancewise array dependence analysis computes a finite, intensional representation of the (statically unbounded) set of all dynamic dependences. This problem has always been considered non-scalable and/or an overkill with respect to less expressive and faster dependence tests. On the contrary, this article presents experimental evidence of its applicability to full SPEC CPU2000 benchmarks. To make this possible, we revisit the characterization of data dependences, considering relations between time dimensions of the transformed space. Beyond algorithmic benefits, this naturally leads to a novel way of reasoning about violated dependences across arbitrary transformation sequences. Reasoning about violated dependences relieves the compiler designer from the cumbersome task of implementing specific legality checks for each single transformation. It also allows, in the case of invalid transformations, to precisely determine the violated dependences that need to be corrected. Identifying these violations can in turn enable automatic correction schemes to fix an illegal transformation sequence with minimal changes
    corecore